Data Loads

Michelle’s PEG Data

How Correlated Are These Data?


Michelle: There was very high correlation for a given search term between the Detroit Metro data and all of MI data. Is this because the area considered Detroit Metro is very large (>4Million - do we know the size?) thus representing over 40% of the state population and possibly an even larger percentage of those with internet? This correlation was higher than the correlation between the different terms for an area.

Jules: We’d expect a high correlation between the “Michigan” search term and the equivalent “Detroit” search term because a) the “Detroit” item is part of the “Michigan” whole and b) Detroit has a high concentration of Michigan’s total population. We see that with MI_SF to DET_SF = 0.86 and MI_Noro to DET_Noro = 0.85. I imagine the other correlations are still relatively high due to the general symptom & seasonal overlap between the two search terms.


Norovirus Influent Wastewater Testing


Michelle: The Avg concentration values for a given WWTP don’t correlate well with the PMMoV normalized values for the same WWTP - where I saw decent correlation values when I ran this. Something seems off with the avg concentration values?

Jules: It looks like it has something to do with the “oldest” data that was just added to the system. Something is definitely different there - the easiest way to troubleshoot that would probably be if I could look at the dataset Michelle was using?


How Correlated Are These Data?

How Correlated Are Search Data and WWTP Data?


Michelle: For the same time, the highest correlations between PMMoV normalized values and search terms were seen for YC and FL, lowest for TM and JS. TM and JS were our highest peaks overall - perhaps this affected the correlation. - Can we separate the charts that compare the PMMoV normalized values to the search terms from the Avg concentration values (see slides 6 and 7)? | Higher correlation overall for stomach flu than for norovirus (on the same day) compared to WW values.


Timeseries Comparisons

##  [1] "Date"                 "AA_NorovirusAvgConc"  "AA_NVnormalizedPMMoV"
##  [4] "FL_NorovirusAvgConc"  "FL_NVnormalizedPMMoV" "JS_NorovirusAvgConc" 
##  [7] "JS_NVnormalizedPMMoV" "TM_NorovirusAvgConc"  "TM_NVnormalizedPMMoV"
## [10] "YC_NorovirusAvgConc"  "YC_NVnormalizedPMMoV" "MI_Noro"             
## [13] "MI_SF"                "DET_Noro"             "DET_SF"              
## [16] "DET_GE"               "MI_GE"

Does Lagging the Data Affect the Correlation Between Search Data and WWTP Data?

## [1] "Correlation Scatter Plots, Lag = 7"

## [1] "Correlation Scatter Plots, Lag = 14"

## [1] "Correlation Scatter Plots, Lag = 21"

## [1] "Correlation Scatter Plots, Lag = 7"

## [1] "Correlation Scatter Plots, Lag = 14"

## [1] "Correlation Scatter Plots, Lag = 21"


Michelle: For the cross correlations the labels said that the search data lagged compared to WW data. The results in slide 9 show values correlate best when NV search term is 7 to 21 days (14 for YC) earlier than wastewater values for Detroit (-21 to -7 lag). Am I understanding this wrong? If we look at the graph in slide 8, NV seems to be coming up later than the WW values for YC. | Values correlate best when Detroit metro SF search term is the same day as WW values (except TM), more variation seen for MI SF search timing.

Jules: The data here has the Search data lagged relative to the WW data (so, the WW dates are held constant, and the “lag” is for the search data, and therefore a “-7 day lag” means that the data for WW was held true [ex. 11/21/2022] and assigned the search data from seven days later [11/28/2022]) – Updated 11/30/2022: The WW data is lagged relative to the Search data (so, the Search dates are held constant, and the “lag” is for the WW data, and therefore a “-7 day lag” means that the data for Search was held true [ex. 11/21/2022] and assigned the WW data from seven days later [11/28/2022])